Expected Policy Gradients

نویسندگان

  • Kamil Ciosek
  • Shimon Whiteson
چکیده

We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates across the action when estimating the gradient, instead of relying only on the action in the sampled trajectory. We establish a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases. We also prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and, for the Gaussian case, with no computational overhead. Finally, we show that it is optimal in a certain sense to explore with a Gaussian policy such that the covariance is proportional to e , where H is the scaled Hessian of the critic with respect to the actions. We present empirical results confirming that this new form of exploration substantially outperforms DPG with the Ornstein-Uhlenbeck heuristic in four challenging MuJoCo domains. Introduction Policy gradient methods (Sutton et al., 2000; Peters and Schaal, 2006, 2008b; Silver et al., 2014), which optimise policies by gradient ascent, have enjoyed great success in reinforcement learning problems with large or continuous action spaces. The archetypal algorithm optimises an actor, i.e., a policy, by following a policy gradient that is estimated using a critic, i.e., a value function. The policy can be stochastic or deterministic, yielding stochastic policy gradients (SPG) (Sutton et al., 2000) or deterministic policy gradients (DPG) (Silver et al., 2014). The theory underpinning these methods is quite fragmented, as each approach has a separate policy gradient theorem guaranteeing the policy gradient is unbiased under certain conditions. Furthermore, both approaches have significant shortcomings. For SPG, variance in the gradient estimates means that many trajectories are usually needed for learning. Since gathering trajectories is typically expensive, there is a great need for more sample efficient methods. DPG’s use of deterministic policies mitigates the problem of variance in the gradient but raises other difficulties. The theoretical support for DPG is limited since it assumes a Copyright c © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved. critic that approximates ∇aQ when in practice it approximates Q instead. In addition, DPG learns off-policy1, which is undesirable when we want learning to take the cost of exploration into account. More importantly, learning off-policy necessitates designing a suitable exploration policy, which is difficult in practice. In fact, efficient exploration in DPG is an open problem and most applications simply use independent Gaussian noise or the Ornstein-Uhlenbeck heuristic (Uhlenbeck and Ornstein, 1930; Lillicrap et al., 2015). In this paper, we propose a new approach called expected policy gradients (EPG) that unifies policy gradients in a way that yields both theoretical and practical insights. Inspired by expected sarsa (Sutton and Barto, 1998; van Seijen et al., 2009), the main idea is to integrate across the action selected by the stochastic policy when estimating the gradient, instead of relying only on the action selected during the sampled trajectory. EPG enables two theoretical contributions. First, we establish a number of equivalences between EPG and DPG, among which is a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases. Second, we prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and, for the Gaussian case, with no computational overhead over SPG. EPG also enables a practical contribution: a principled exploration strategy for continuous problems. We show that it is optimal in a certain sense to explore with a Gaussian policy such that the covariance is proportional to e , where H is the scaled Hessian of the critic with respect to the actions. We present empirical results confirming that this new approach to exploration substantially outperforms DPG with Ornstein-Uhlenbeck exploration in four challenging MuJoCo domains. Background A Markov decision process is a tuple (S,A,R, p, p0, γ) where S is a set of states, A is a set of actions (in practice either A = R or A is finite), R(s, a) is a reward function, p(s′ | a, s) is a transition kernel, p0 is an initial state distribution, and γ ∈ [0, 1) is a discount factor. A policy π(a | s) We show in this paper that, in certain settings, off-policy DPG is equivalent to EPG, our on-policy method. ar X iv :1 70 6. 05 37 4v 4 [ st at .M L ] 3 0 N ov 2 01 7 is a distribution over actions given a state. We denote trajectories as τ = (s0, a0, r0, s1, a1, r1, . . . ), where s0 ∼ p0, at ∼ π(· | st−1) and rt is a sample reward. A policy π induces a Markov process with transition kernel pπ(s | s) = ∫ a dπ(a | s)p(s′ | a, s) where we use the symbol dπ(a | s) to denote Lebesgue integration against the measure π(a | s) where s is fixed. We assume the induced Markov process is ergodic with a single invariant measure defined for the whole state space. The value function is V π = Eτ [ ∑ i γiri] where actions are sampled from π. The Q-function is Q(a | s) = ER [r | s, a] + γEp(s |s) [V π(s′) | s] and the advantage function is A(a | s) = Q(a | s) − V (s). An optimal policy maximises the total return J = ∫ s dp0(s)V (s). Since we consider only on-policy learning with just one current policy, we drop the π super/subscript where it is redundant. If π is parameterised by θ, then stochastic policy gradients (SPG) (Sutton et al., 2000; Peters and Schaal, 2006, 2008b) perform gradient ascent on∇J , the gradient of J with respect to θ (gradients without a subscript are always with respect to θ). For stochastic policies, we have: ∇J = ∫

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Expected Policy Gradients for Reinforcement Learning

We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates (or sums) across actions when estimating the gradient, instead of relying only on the action in the sampled trajectory. For continuous action spaces, we first derive a practical result for Gaussi...

متن کامل

Regularized Policy Gradients: Direct Variance Reduction in Policy Gradient Estimation

Policy gradient algorithms are widely used in reinforcement learning problems with continuous action spaces, which update the policy parameters along the steepest direction of the expected return. However, large variance of policy gradient estimation often causes instability of policy update. In this paper, we propose to suppress the variance of gradient estimation by directly employing the var...

متن کامل

Fourier Policy Gradients

We propose a new way of deriving policy gradient updates for reinforcement learning. Our technique, based on Fourier analysis, recasts integrals that arise with expected policy gradients as convolutions and turns them into multiplications. The obtained analytical solutions allow us to capture the low variance benefits of EPG in a broad range of settings. For the critic, we treat trigonometric a...

متن کامل

Particle Value Functions

The policy gradients of the expected return objective can react slowly to rare rewards. Yet, in some cases agents may wish to emphasize the low or high returns regardless of their probability. Borrowing from the economics and control literature, we review the risk-sensitive value function that arises from an exponential utility and illustrate its effects on an example. This risk-sensitive value...

متن کامل

Adaptive Batch Size for Safe Policy Gradients

PROBLEM • Monotonically improve a parametric gaussian policy πθ in a continuous MDP, avoiding unsafe oscillations in the expected performance J(θ). • Episodic Policy Gradient: – estimate ∇̂θJ(θ) from a batch of N sample trajectories. – θ′ ← θ+Λ∇̂θJ(θ) • Tune step size α and batch size N to limit oscillations. Not trivial: – Λ: trade-off with speed of convergence← adaptive methods. – N : trade-off...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1706.05374  شماره 

صفحات  -

تاریخ انتشار 2017